Word Transformation Heuristics Agains Lexicons for Cognate Detection
نویسنده
چکیده
One of the most common lexical transformations between cognates in French and English is the presence or absence of a terminal “e”. However, many other transformations exist, such as a vowel with a circumflex corresponding to the vowel and the letter s. Our algorithms tested the effectiveness of taking the entire English and French lexicons from Treetagger, deaccenting the French lexicon, and taking the intersection of the two. Words shorter than 6 letters were excluded from the list, and a set of lexical transformations were also used prior to intersecting, to increase the potential pool of cognates. The result was 15% above the baseline cognate list in the initial test set, but only 1% above it in the final test set. However, its accuracy was consistant at about 37% for both test sets.
منابع مشابه
Emergency medicine, disease surveillance, and informatics
Traditional handwriting recognition algorithms rely heavily on small lexicons and clean word images. Unfortunately, emergency medical documents do not satisfy either of these conditions. This paper describes a strategy whereby given an image representing a noisy handwritten word from a medical document, and a large lexicon consisting of English, medical and pharmacological words, symbols, abbre...
متن کاملLIHLA: A lexical aligner based on language-independent heuristics
Alignment of words and multiword units plays an important role in many natural language processing applications, such as example-based machine translation, transfer rule learning for machine translation, bilingual lexicography, word sense disambiguation, etc. In this paper we describe LIHLA, a lexical aligner which uses bilingual probabilistic lexicons generated by a freely available set of too...
متن کاملThe Reconstruction Engine: A Computer Implementation of the Comparative Method
We describe the implementation of a computer program, the Reconstruction Engine (RE), which models the comparative method for establishing genetic affiliation among a group of languages. The program is a research tool designed to aid the linguist in evaluating specific hypotheses, by calculating the consequences of a set of postulated sound changes (proposed by the linguist) on complete lexicon...
متن کاملExtracting Translation Lexicons from Bilingual Corpora: Application to South-Slavonic Languages
The paper presents a novel approach for automatic translation lexicon extraction from a parallel sentence-aligned corpus. This is a five-step process, which includes cognate extraction, word alignment, phrase extraction, statistical phrase filtering, and linguistic phrase filtering. Unlike other approaches whose objective is to extract word or phrase pairs to be used in machine translation, we ...
متن کاملInitial Results in the Development of SCAN A Swedish Clinical Abbreviation Normalizer
Abbreviations are common in clinical documentation, as this type of text is written under time-pressure and serves mostly for internal communication. This study attempts to apply and extend existing rule-based algorithms that have been developed for English and Swedish abbreviation detection, in order to create an abbreviation detection algorithm for Swedish clinical texts that can identify and...
متن کامل